Modeling the Solubility of Organic Compounds in 1-Octanol
Researchers: Michael A Buonaiuto and Andrew SID Lang
Procedure
We extracted all octanol solubility data from the open notebook science challenge solubility database. Octanol Data From ONS.csv
After removing all items marked “DONOTUSE” and obvious outliers, we looked for entries with multiple values.
For solutes with multiple solubility values, we kept only the ones listed in the Abraham paper. [1] If there was no Abraham paper value, we kept the RAEVSKY OA value. In the rare case that there were two Abraham paper (or RAEVSKY OA [2]) values for a single chemspider ID, we kept the higher of the two. This resulted in a file of 261 data points. Refined Octanol Data From ONS.csv.
Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our refined data file. We selected the option to add explicit hydrogens. Solubility Data with CDK Descriptors.Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."
mydata = read.csv(file="20150728InputForFeatureSelection.csv",head=TRUE,row.names="csid")
ncol(mydata) #get number of columns
[output]
287
[output]
library(caret) ##for feature selection
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
222
[output]
write.csv(mydata,file="DescriptorsNoNA.csv")
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
143
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")
After the above code was executed, two rows that had multiple NA entries were also deleted in order to remove highly correlated descriptors as part of the feature selection process.
Feature Selection
Feature selection was performed using the following R code that removed highly correlated descriptors:
The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 36% of the variation and clustered the data into two group upon the recommendation from the pam function in R. Using DMax Chemistry assistant, we see the two groups are separated by nHBAcc and TopoPSA with the red cluster corresponding to compounds with zero hydrogen bond acceptors and with a topological polar surface area less than 26.48 angstroms. The red cluster is typified by compounds without hydrogen bond acceptors and with ALogP > 1.56 and with TopoPSA < 26.48; 128 out of 157 compounds match this criteria. The blue cluster is more chemically diverse than the red cluster but even so 75 of the 102 compounds have ALogP < 1.56 and TopoPSA > 26.48 and at least one hydrogen bond acceptor.
Overall the dataset consists of small (typically 100 < MW < 450) druglike compounds (50 compounds have one Lipinski Failure) that are solid at room temperature with 1-octanol solubilities typically between 0.01 M and 1M . Looking at the nAcid descriptor, we see that 49 compounds are carboxylic acids.
Chemical Space - Clustered Using R
Modeling
##Modeling
library("randomForest")
setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
mydata.rf <- randomForest(LogS ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = LogS ~ ., data = train, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 0.3800402
% Var explained: 62.66
[output]
saveRDS(mydata.rf, file = "OctanolTrainingSet")
training.predict <- predict(mydata.rf,train)
write.csv(training.predict, file = "RFTrainingSetPredict.csv")
test.predict <- predict(mydata.rf,test)
write.csv(test.predict, file = "RFTestSetPredict.csv")
Predicted vs Measured 1-octanol solubility for the test-set
The test-set statistics were: AAE 0.53, MSE 0.44, R2 0.54
Create RF model using all data.
library("randomForest")
setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments")
mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid")
mydata.rf <- randomForest(LogS ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = LogS ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 0.3435526
% Var explained: 65.7
[output]
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
saveRDS(mydata.rf, file = "Octanol1")
training.predict <- predict(mydata.rf,mydata)
write.csv(training.predict, file = "FullRFTrainingSetPredict.csv")
The most important descriptors (ALogP, XLogP, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc) can be seen on the following plot:
Variable Importance Plot
Note: The correlations between the ALogP and XLogP descriptors = 0.83. If removing all descriptors with correlations >=0.83 was done, then only 70 descriptors would remain instead of 87.
Results
PCA Colored by AE
The following results are an improvement of the model by Abraham and Acree who report a Training-Set R2 of 0.83
RF Out of Bag (Training Set)
Mean Square Error: 0.38
R2: 0.63
Test-Set
Mean Square Error: 0.44
R2: 0.54
Full RF Out of Bag
Mean Square Error: 0.34
R2: 0.66
Full RF Training-Set
Mean Square Error: 0.06
R2: 0.94
Performance Issues
Using DMax to find where the model does well and where is does not do well. The only significant factor is LogS itself. This may be due to the quality of data for low solubilities. When the MSE and R2 values are calculated for compunds both above and below 0.01 M, we see that for compounds with LogS <= -0.2 the MSE is 0.13 and for compounds with LogS > -0.2 the MSE is 0.05. By PC-distance - no relationship, unlike what we saw previously for Abraham solvent coefficients: http://journal.chemistrycentral.com/content/9/1/12
References
1. Abraham, Michael H., and William E. Acree. The solubility of liquid and solid compounds in dry octan-1-ol. Chemosphere 103 (2014): 26-34. doi: dx.doi.org/10.1016/j.chemosphere.2013.10.095
2. Raevsky, O.A., Perlovich, G.L., Schaper, K.-J., 2007. Physicochemical properties/descriptors governing the solubility and partitioning of chemicals in water-solvent gas systems. Part 2. Solubility in 1-octanol. SAR QSAR Environ. Res. 18, 543–578. doi: 10.1080/10629360701430124
Modeling the Solubility of Organic Compounds in 1-Octanol
Researchers: Michael A Buonaiuto and Andrew SID LangProcedure
We extracted all octanol solubility data from the open notebook science challenge solubility database. Octanol Data From ONS.csvAfter removing all items marked “DONOTUSE” and obvious outliers, we looked for entries with multiple values.
For solutes with multiple solubility values, we kept only the ones listed in the Abraham paper. [1] If there was no Abraham paper value, we kept the RAEVSKY OA value. In the rare case that there were two Abraham paper (or RAEVSKY OA [2]) values for a single chemspider ID, we kept the higher of the two. This resulted in a file of 261 data points. Refined Octanol Data From ONS.csv.
Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our refined data file. We selected the option to add explicit hydrogens. Solubility Data with CDK Descriptors.Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."After the above code was executed, two rows that had multiple NA entries were also deleted in order to remove highly correlated descriptors as part of the feature selection process.
Feature Selection
Feature selection was performed using the following R code that removed highly correlated descriptors:Description of Data
The data that we will use to create our RF models now consists of 259 1-octanol solubility values with 86 descriptors, see the histogram below:Principal Component & Cluster Analysis
Cluster Analysis## determining the best number of clusters library(fpc) library(cluster) mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") asw <- numeric(20) for (k in 2:20) asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width asw [output] [1] 0.0000000 0.7352926 0.4876591 0.3789633 0.4052174 0.4129726 0.4262371 0.4391316 0.4286310 0.3785256 0.3851924 [12] 0.3942103 0.4035164 0.4044112 0.3437240 0.3447198 0.3432463 0.3398725 0.3454049 0.3415452 [output] k.best <- which.max(asw) cat("silhouette-optimal number of clusters:", k.best, "\n") # 2 ##Cluster Analysis mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") mydata <- scale(mydata) # standardize variables # K-Means Cluster Analysis fit <- kmeans(mydata, 2) # 2 cluster solution # get cluster means aggregate(mydata,by=list(fit$cluster),FUN=mean) [output] Too Much To Display [output] # append cluster assignment mydata <- data.frame(mydata, fit$cluster) write.csv(mydata, file = "clusters.csv")Principal Component Analysis
##PCA mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") pc1 <- prcomp(mydata, scale. = T) x <- pc1$x summary(pc1) [output] Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 Standard deviation 4.0922 3.8658 3.2466 2.5916 2.32777 1.99701 1.72687 1.50001 1.43944 1.43011 1.33050 1.29421 Proportion of Variance 0.1925 0.1718 0.1212 0.0772 0.06228 0.04584 0.03428 0.02586 0.02382 0.02351 0.02035 0.01925 Cumulative Proportion 0.1925 0.3643 0.4854 0.5626 0.62490 0.67074 0.70501 0.73088 0.75469 0.77820 0.79855 0.81780 [output] write.csv(x, file = "PCA.csv")The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 36% of the variation and clustered the data into two group upon the recommendation from the pam function in R. Using DMax Chemistry assistant, we see the two groups are separated by nHBAcc and TopoPSA with the red cluster corresponding to compounds with zero hydrogen bond acceptors and with a topological polar surface area less than 26.48 angstroms. The red cluster is typified by compounds without hydrogen bond acceptors and with ALogP > 1.56 and with TopoPSA < 26.48; 128 out of 157 compounds match this criteria. The blue cluster is more chemically diverse than the red cluster but even so 75 of the 102 compounds have ALogP < 1.56 and TopoPSA > 26.48 and at least one hydrogen bond acceptor.
Overall the dataset consists of small (typically 100 < MW < 450) druglike compounds (50 compounds have one Lipinski Failure) that are solid at room temperature with 1-octanol solubilities typically between 0.01 M and 1M . Looking at the nAcid descriptor, we see that 49 compounds are carboxylic acids.
Chemical Space - Clustered Using R
Modeling
##Modeling library("randomForest") setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments") mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") ## 75% of the sample size smp_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size) train <- mydata[train_ind, ] test <- mydata[-train_ind, ] mydata.rf <- randomForest(LogS ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = LogS ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28 Mean of squared residuals: 0.3800402 % Var explained: 62.66 [output] saveRDS(mydata.rf, file = "OctanolTrainingSet") training.predict <- predict(mydata.rf,train) write.csv(training.predict, file = "RFTrainingSetPredict.csv") test.predict <- predict(mydata.rf,test) write.csv(test.predict, file = "RFTestSetPredict.csv")Predicted vs Measured 1-octanol solubility for the test-set
The test-set statistics were: AAE 0.53, MSE 0.44, R2 0.54
Create RF model using all data.
library("randomForest") setwd("C:/Users/alang/Dropbox/research/OctanolSolubility/AddressingReviewersComments") mydata = read.csv(file="DescriptorsFeatureSelected.csv",head=TRUE,row.names="csid") mydata.rf <- randomForest(LogS ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = LogS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28 Mean of squared residuals: 0.3435526 % Var explained: 65.7 [output] varImpPlot(mydata.rf,main="Random Forest Variable Importance") saveRDS(mydata.rf, file = "Octanol1") training.predict <- predict(mydata.rf,mydata) write.csv(training.predict, file = "FullRFTrainingSetPredict.csv")The most important descriptors (ALogP, XLogP, TopoPSA, nAtomP, MDEC.23, khs,aaCH, nHBAcc) can be seen on the following plot:
Note: The correlations between the ALogP and XLogP descriptors = 0.83. If removing all descriptors with correlations >=0.83 was done, then only 70 descriptors would remain instead of 87.
Results
The following results are an improvement of the model by Abraham and Acree who report a Training-Set R2 of 0.83
RF Out of Bag (Training Set)
Mean Square Error: 0.38
R2: 0.63
Test-Set
Mean Square Error: 0.44
R2: 0.54
Full RF Out of Bag
Mean Square Error: 0.34
R2: 0.66
Full RF Training-Set
Mean Square Error: 0.06
R2: 0.94
Performance Issues
Using DMax to find where the model does well and where is does not do well. The only significant factor is LogS itself. This may be due to the quality of data for low solubilities. When the MSE and R2 values are calculated for compunds both above and below 0.01 M, we see that for compounds with LogS <= -0.2 the MSE is 0.13 and for compounds with LogS > -0.2 the MSE is 0.05. By PC-distance - no relationship, unlike what we saw previously for Abraham solvent coefficients:http://journal.chemistrycentral.com/content/9/1/12
References
1. Abraham, Michael H., and William E. Acree. The solubility of liquid and solid compounds in dry octan-1-ol. Chemosphere 103 (2014): 26-34. doi: dx.doi.org/10.1016/j.chemosphere.2013.10.0952. Raevsky, O.A., Perlovich, G.L., Schaper, K.-J., 2007. Physicochemical properties/descriptors governing the solubility and partitioning of chemicals in water-solvent gas systems. Part 2. Solubility in 1-octanol. SAR QSAR Environ. Res. 18, 543–578. doi: 10.1080/10629360701430124